16 research outputs found
The structure of verbal sequences analyzed with unsupervised learning techniques
Data mining allows the exploration of sequences of phenomena, whereas one
usually tends to focus on isolated phenomena or on the relation between two
phenomena. It offers invaluable tools for theoretical analyses and exploration
of the structure of sentences, texts, dialogues, and speech. We report here the
results of an attempt at using it for inspecting sequences of verbs from French
accounts of road accidents. This analysis comes from an original approach of
unsupervised training allowing the discovery of the structure of sequential
data. The entries of the analyzer were only made of the verbs appearing in the
sentences. It provided a classification of the links between two successive
verbs into four distinct clusters, allowing thus text segmentation. We give
here an interpretation of these clusters by applying a statistical analysis to
independent semantic annotations
Model-based Co-clustering for High Dimensional Sparse Data
Abstract We propose a novel model based on the von Mises-Fisher (vMF) distribution for coclustering high dimensional sparse matrices. While existing vMF-based models are only suitable for clustering along one dimension, our model acts simultaneously on both dimensions of a data matrix. Thereby it has the advantage of exploiting the inherent duality between rows and columns. Setting our model under the maximum likelihood (ML) approach and the classification ML (CML) approach, we derive two novel, hard and soft, co-clustering algorithms. Empirical results on numerous synthetic and real-world text datasets, demonstrate the effectiveness of our approach, for modelling high dimensional sparse data and co-clustering. Furthermore, thanks to our formulation, that performs an implicitly adaptive dimensionality reduction at each stage, our model alleviates the problem of high concentration parameters kappa's, a well known difficulty in the classical vMF-based models
Enchaînements verbaux - étude sur le temps et l'aspect utilisant des techniques d'apprentissage non supervisé
10 pagesNational audienceUnsupervised learning allows the discovery of initially unknown categories. Current techniques make it possible to explore sequences of phenomena whereas one tends to focus on the analysis of isolated phenomena or on the relation between two phenomena. They offer thus invaluable tools for the analysis of sequential data, and in particular, for the discovery of textual structures. We report here the results of a first attempt at using them for inspecting sequences of verbs coming from sentences of French accounts of road accidents. Verbs were encoded as pairs (cat, tense) – where cat is the aspectual category of a verb, and tense its grammatical tense. The analysis, based on an original approach, provided a classification of the links between two successive verbs into four distinct groups (clusters) allowing texts segmentation. We give here an interpretation of these clusters by using statistics on semantic annotations independent of the training process
Hybrid Unsupervised Learning to Uncover Discourse Structure
volume of the best papers of LTC'07International audienceData mining allows the exploration of sequences of phenomena, whereas one usually tends to focus on isolated phenomena or on the relation between two phenomena. It offers invaluable tools for theoretical analyses and exploration of the structure of sentences, texts, dialogues, and speech. We report here the results of an attempt at using it for inspecting sequences of verbs from French accounts of road accidents. This analysis comes from an original approach of unsupervised training allowing the discovery of the structure of sequential data. The entries of the analyzer were only made of the verbs appearing in the sentences. It provided a classification of the links between two successive verbs into four distinct clusters, allowing thus text segmentation. We give here an interpretation of these clusters by comparing the statistical distribution of independent semantic annotations
Stochastic Co-clustering for Document-Term Data
International audienceCo-clustering is more useful than one-sided clustering when dealing with high dimensional sparse data. We propose to address the aim of document clustering with a generative model-based co-clustering approach. To this end, we rely on a particular mixture of von Mises-Fisher distributions and propose a new parsimonious model allowing to reveal a block diagonal structure as well as a good partitioning of documents and terms. Then, by setting the estimate of the model parameters under the maximum likelihood (ML) approach, we derive three novel co-clustering algorithms: a soft one and two stochastic variants. Empirical results on numerous simulated and real-world datasets, demonstrate the advantages of our approach to model and co-cluster high dimensional sparse data
An Efficient Incremental Collaborative Filtering System
International audienceCollaborative filtering (CF) systems aim at recommending a set of personalized items for an active user, according to the preferences of other similar users. Many methods have been developed and some, such those based on Similarity and Matrix Factorization (MF) can achieve very good recommendation accuracy, but unfortunately they are computationally prohibitive. Thus, applying such approaches to real-world applications in which available information evolves frequently, is a non-trivial task. To address this problem, we propose a novel efficient incremental CF system, based on a weighted clustering approach. Our system is able to provide a high quality of recommendations with a very low computation cost. Experimental results on several real-world datasets, confirm the efficiency and the effectiveness of our method by demonstrating that it is significantly better than existing incremental CF methods in terms of both scalability and recommendation quality
Sequencing of verbs - a study on tense and aspect using unsupervised learning
International audienceWe report here the results of an attempt at using data mining tools for inspecting sequences of verbs from French accounts of road accidents. This analysis comes from an original approach of unsupervised learning allowing the discovery of the structure of sequential data. The entries of the analyzer were only made for the verbs appearing in the sentences. It provided a classification of the linking between two successive verbs into four distinct clusters, allowing thus text segmentation. We give here an interpretation of these clusters by applying a statistical analysis to independent semantic annotations
Apprentissage neuro-markovien pour la classification non supervisée de données structurées en séquences
International audienc